use optional dil tensor & move semantics #24

pinzhenx · 2020-05-25T05:23:09Z

This PR is to optimize the dispatcher for aten ops:

optimize out dummy dil tensor creation in ShadeDataContext

We noticed the creation of dummy DIL tensor in ShadeDataContext cost a significant amount of time. So I change the dil_tensor field to optional.

use move semantics in gen_aten_tensor_by

Avoid the unnecessary shallow copy of tensors. Hence we're going to use it this way with a std::move.

at::Tensor AtenIpexCPUDev::foo(const at::Tensor& x) {
  dil::tensor y;
  ...
  return dbl::comm::gen_aten_tensor_by(std::move(y));
}

This PR reduce the dispatcher overhead from 90%+ to 20%+. The result was gathered under single thread with jemalloc enabled. The remaining gap is mostly cost by shallowUpgradeToDPCPPTensor. That's where we need to optimize in the future.

Here's the benchmark script.

import torch
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--ipex", action="store_true", default=False)
parser.add_argument('--num-warmup-runs', type=int, default=10)
parser.add_argument('--num-main-runs', type=int, default=100000)
args = parser.parse_args()

x = torch.rand(1, 1)
y_ref = x.relu()
if args.ipex:
    print("# USE IPEX")
    import _torch_ipex
    _torch_ipex._initialize_aten_bindings()
    x = x.to('dpcpp')
else:
    print("# NO IPEX")


for _ in range(args.num_warmup_runs):
    y = x.relu()

with torch.autograd.profiler.profile(True) as prof:
    for _ in range(args.num_main_runs):
        y = x.relu()

print(prof.key_averages().table(sort_by="self_cpu_time_total"))
assert(torch.equal(y_ref, y))

pinzhenx · 2020-05-25T05:25:07Z

@EikanWang @hongzhen1

EikanWang · 2020-05-25T05:30:58Z

LGTM

EikanWang · 2020-05-25T07:35:54Z

@hongzhen1 r u okay with this optimization? I will merge this PR first in case some new modifications will break this.

hongzhen1 · 2020-05-25T07:44:04Z

@hongzhen1 r u okay with this optimization? I will merge this PR first in case some new modifications will break this.

LGTM

* enable fp32 lstm in cpu device * lstm enable bf16 * Implement unit test * add gather into black list * Remove unnecessary lines and move test case position * hook at module level * copy _flat_weights into IpexLSTM # model.bias_ih_l0 will be incorrect * add fp32 unit test * refactor LSTM UT * update comments Co-authored-by: chunyuan <chunyuan.wu@intel.com>

pinzhenx force-pushed the optional branch 2 times, most recently from af2f809 to 0db0e60 Compare May 25, 2020 05:38

pinzhenx marked this pull request as draft May 25, 2020 14:41

pinzhenx force-pushed the optional branch from 0db0e60 to 5099b96 Compare May 25, 2020 15:33

pinzhenx marked this pull request as ready for review May 25, 2020 15:38

pinzhenx added 2 commits May 25, 2020 23:16

use optional dil tensor & move semantics

c747109

fix typo

5099b96

EikanWang merged commit a79be1b into intel:master May 26, 2020

NathanJHLee mentioned this pull request Jul 7, 2022

required rank 4 tensor to use channels_last format #234

Open

Steve-Tech mentioned this pull request Aug 6, 2023

RuntimeError: Number of dpcpp devices should be greater than zero! #287

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use optional dil tensor & move semantics #24

use optional dil tensor & move semantics #24

Uh oh!

pinzhenx commented May 25, 2020 •

edited

Loading

Uh oh!

pinzhenx commented May 25, 2020

Uh oh!

EikanWang commented May 25, 2020

Uh oh!

EikanWang commented May 25, 2020

Uh oh!

hongzhen1 commented May 25, 2020

Uh oh!

Uh oh!

use optional dil tensor & move semantics #24

use optional dil tensor & move semantics #24

Uh oh!

Conversation

pinzhenx commented May 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pinzhenx commented May 25, 2020

Uh oh!

EikanWang commented May 25, 2020

Uh oh!

EikanWang commented May 25, 2020

Uh oh!

hongzhen1 commented May 25, 2020

Uh oh!

Uh oh!

pinzhenx commented May 25, 2020 •

edited

Loading